In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10

Automatic alignment on labels


In [2]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data).set_index('country')
countries


Out[2]:
area capital population
country
Belgium 30510 Brussels 11.3
France 671308 Paris 64.3
Germany 357050 Berlin 81.3
Netherlands 41526 Amsterdam 16.9
United Kingdom 244820 London 64.9

Series alignment

Let's define a table with natural increase rates in 2013 (data from World Bank):


In [3]:
death_rate = pd.Series([10, 9, 11, 8, 9],
                       index=['Poland','United Kingdom', 'Germany', 'Netherlands', 'France'])
print(death_rate)


Poland            10
United Kingdom     9
Germany           11
Netherlands        8
France             9
dtype: int64

In [4]:
birth_rate = pd.Series([10, 9, 10, 12],
                       index=['Netherlands',  'Germany', 'Poland', 'France'])
print(birth_rate)


Netherlands    10
Germany         9
Poland         10
France         12
dtype: int64

Now we calculate the natural increae by subtracting death rate from birth rate:


In [5]:
natural_increase = birth_rate - death_rate
print(natural_increase)


France             3
Germany           -2
Netherlands        2
Poland             0
United Kingdom   NaN
dtype: float64

Note that the rows where the two series did not overlap contain missing values (NaN = Not a Number) and that the data were properly aligned on the index.


In [6]:
pop_change = pd.DataFrame({'death rate' : death_rate, 
                           'birth rate' : birth_rate,
                           'natural increase' : natural_increase})

Missing values

We can remove the missing data using dropna method:


In [7]:
pop_change.dropna(inplace=True)

In [8]:
pop_change


Out[8]:
birth rate death rate natural increase
France 12 9 3
Germany 9 11 -2
Netherlands 10 8 2
Poland 10 10 0
EXERCISE: Calculate estimated population in 2014 by summing the population and natural increase (remember that the natural increase is given per 1000 people).

In [ ]:

EXERCISE: Calculate ratio of the highest and lowest estimated population in 2014.

In [ ]:

Joining two data frames

Let's now try to add the data to the country data:


In [11]:
countries.join(pop_change)


Out[11]:
area capital population birth rate death rate natural increase
country
Belgium 30510 Brussels 11.3 NaN NaN NaN
France 671308 Paris 64.3 12 9 3
Germany 357050 Berlin 81.3 9 11 -2
Netherlands 41526 Amsterdam 16.9 10 8 2
United Kingdom 244820 London 64.9 NaN NaN NaN

There are four different ways we can handle the missing rows:

  • left (default) — take all the rows of the left data frame
  • right — take all the rows of the right data frame
  • inner — take the common rows of both data frames
  • outer — take all the rows present in either or both data frames

Note that the methods are similar to SQL JOIN clause.


In [12]:
countries.join(pop_change, how='right')


Out[12]:
area capital population birth rate death rate natural increase
France 671308 Paris 64.3 12 9 3
Germany 357050 Berlin 81.3 9 11 -2
Netherlands 41526 Amsterdam 16.9 10 8 2
Poland NaN NaN NaN 10 10 0
EXERCISE: Try inner and outer join. What's the difference?

In [ ]:

Groupby operations

Some 'theory': the groupby operation (split-apply-combine)

By "group by" we are referring to a process involving one or more of the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

Similar to SQL GROUP BY

The example of the image in pandas syntax:


In [14]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df


Out[14]:
data key
0 0 A
1 5 B
2 10 C
3 5 A
4 10 B
5 15 C
6 10 A
7 15 B
8 20 C

In [15]:
df.groupby('key').aggregate('sum')  # np.sum


Out[15]:
data
key
A 15
B 30
C 45

In [16]:
df.groupby('key').sum()


Out[16]:
data
key
A 15
B 30
C 45

You can also simply count members of each split:


In [17]:
df.groupby('key').size()


Out[17]:
key
A    3
B    3
C    3
dtype: int64

Movie database

These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.


In [18]:
cast = pd.read_csv('data/cast.csv')
cast[10:15]


Out[18]:
title year name type character n
10 Mixing Nia 1998 Michael 'babeepower' Viera actor Rapper NaN
11 The Replacements 2000 Steven 'Bear'Boyd actor Defensive Tackle - Washington Sentinels NaN
12 Dysfunktion 2016 Kirlew 'bliss' Vilbon actor Bliss NaN
13 My Song for You 2010 George 'Bootsy' Thomas actor Cooley's Customer 16
14 My Song for You 2010 George 'Bootsy' Thomas actor Celebration Guest 16

In [19]:
titles = pd.read_csv('data/titles.csv')
titles.head()


Out[19]:
title year
0 The Rising Son 1990
1 Ashes of Kukulcan 2016
2 The Thousand Plane Raid 1969
3 Crucea de piatra 1993
4 The 86 2015
EXERCISE: Using groupby(), plot the number of films that have been released each year in the history of cinema.

In [ ]:

EXERCISE: Use groupby() to determine how many roles are listed for each of The Pink Panther movies.

In [ ]:

Custom grouping criteria

You can also group by the values on another array under the condition that this array has the length equal to the number of rows:


In [22]:
greek = ['α', 'β', 'β', 'β', 'β', 'α', 'β','α', 'α']
df.groupby(greek).max()


Out[22]:
data key
α 20 C
β 10 C

The values for the grouping array can be also computed from values in the data frame. For example, to count odd and even number in the data column we could simply:


In [23]:
df.groupby(df['data'] % 2).size()


Out[23]:
data
0    5
1    4
dtype: int64
EXERCISE: Using groupby(), plot the number of films that have been released each **decade** in the history of cinema.

In [ ]:

EXERCISE: Use groupby() to plot the number of "Hamlet" films made each decade.

In [ ]:

Multiple groups

Note that you can also groupby on multiple keys:


In [26]:
df['type'] = np.where(df['data'] % 2, 'odd', 'even')
print(df)


   data key  type
0     0   A  even
1     5   B   odd
2    10   C  even
3     5   A   odd
4    10   B  even
5    15   C   odd
6    10   A  even
7    15   B   odd
8    20   C  even

In [27]:
df.groupby(['type', 'key']).sum()


Out[27]:
data
type key
even A 10
B 10
C 30
odd A 5
B 20
C 15

In [28]:
df['type'] = np.where(df['data'] % 2, 'odd', 'even')
print(df)
df['data']


   data key  type
0     0   A  even
1     5   B   odd
2    10   C  even
3     5   A   odd
4    10   B  even
5    15   C   odd
6    10   A  even
7    15   B   odd
8    20   C  even
Out[28]:
0     0
1     5
2    10
3     5
4    10
5    15
6    10
7    15
8    20
Name: data, dtype: int64

Note that it creates a hierarchical index. More on that later.

EXERCISE: List each of the characters that Frank Oz has portrayed at least twice.

In [ ]:

EXERCISE: List, in order by year, each of the films in which Frank Oz has played more than 1 role.

In [ ]:

EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?

In [ ]:

Value counts

A useful shortcut to calculate the number of occurences of certain values is value_counts (this is somewhat equivalent to df.groupby(key).size()))

For example, what are the most occuring movie titles?


In [32]:
titles.title.value_counts().head()


Out[32]:
Hamlet                  19
Carmen                  14
Macbeth                 14
The Three Musketeers    12
Maya                    11
Name: title, dtype: int64
EXERCISE: What are the 11 most common character names in movie history?

In [ ]:

Custom aggregate functions

Aggregate function could be any function accepting the Series object.

For example, let's calculate most frequent apperances in each year of last decade:


In [34]:
def most_frequent(x):
    return x.value_counts().index[0]

In [35]:
cast.loc[(cast['year'] >= 2010) & (cast['year'] < 2020), ['year', 'name']].groupby('year').agg(most_frequent)


Out[35]:
name
year
2010 Lloyd Kaufman
2011 Kyle Rea
2012 Stefan Kramer
2013 Eric Roberts
2014 Omer Pasha
2015 Eric Roberts
2016 Bill Oberst Jr.
2017 David (II) Gattis
2018 Dana (II) Waters
2019 Prashast Singh

Extra exercises

EXERCISE: Which actors or actresses appeared in the most movies in the year 2010?
EXERCISE: Plot how many roles Brad Pitt has played in each year of his career.
EXERCISE: What are the 10 most film titles roles that start with the word "The Life"?
EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?

Acknowledgement

© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).

© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)

Licensed under CC BY 4.0 Creative Commons



In [ ]: